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Abstract 

Background: G-protein-coupled receptors (GPCRs) play a key role in diverse physiological processes and are the 
targets of almost two-thirds of the marketed drugs. The 3 D structures of GPCRs are largely unavailable; however, 
a large number of GPCR primary sequences are known. To facilitate the identification and characterization of novel 
receptors, it is therefore very valuable to develop a computational method to accurately predict GPCRs from the 
protein primary sequences. 

Results: We propose a new method called PCA-GPCR, to predict GPCRs using a comprehensive set of 1497 
sequence-derived features. The principal component analysis is first employed to reduce the dimension of the 
feature space to 32. Then, the resulting 32-dimensional feature vectors are fed into a simple yet powerful 
classification algorithm, called intimate sorting, to predict GPCRs at five levels. The prediction at the first level 
determines whether a protein is a GPCR or a non-GPCR. If it is predicted to be a GPCR, then it will be further 
predicted into certain family, subfamily, sub-subfamily and subtype by the classifiers at the second, third, fourth, and 
fifth levels, respectively. To train the classifiers applied at five levels, a non-redundant dataset is carefully 
constructed, which contains 3178, 1589, 4772, 4924, and 2741 protein sequences at the respective levels. Jackknife 
tests on this training dataset show that the overall accuracies of PCA-GPCR at five levels (from the first to the fifth) 
can achieve up to 99.5%, 88.8%, 80.47%, 80.3%, and 92.34%, respectively. We further perform predictions on a 
dataset of 1238 GPCRs at the second level, and on another two datasets of 167 and 566 GPCRs respectively at the 
fourth level. The overall prediction accuracies of our method are consistently higher than those of the existing 
methods to be compared. 

Conclusions: The comprehensive set of 1497 features is believed to be capable of capturing information about 
amino acid composition, sequence order as well as various physicochemical properties of proteins. Therefore, high 
accuracies are achieved when predicting GPCRs at all the five levels with our proposed method. 



Background been estimated that almost two-thirds of drugs on the 
The structure of a G-protein-coupled receptor (GPCR) market interact with GPCRs [3], which indicates that 
generally comprises seven a-helical transmembrane GPCRs are pharmacologically important. Therefore, both 
domains, an extracellular N-terminus, and an intracellular academic and industrial researchers are very interested in 
C-terminus [1]. GPCRs constitute one of the largest family the studies on GPCRs to understand their structures and 
of membrane proteins, and their main function is to trans- functions. Unfortunately, the 3 D protein structures of 
duce extracellular signals into intracellular reactions. GPCRs are largely unavailable [4], except for the GPCR 
Therefore, they play a key role in diverse physiological family bovine rhodopsin. Although some advanced bio- 
processes such as neurotransmission, secretion, cellular technologies such as NMR allow to detect the 3 D protein 
differentiation, cellular metabolism, and so forth [2]. It has structures, their experiments are generally very time- 

consuming and costly. In contrast, a large number of 
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method to predict GPCRs from the protein primary 
sequences. 

Based on their binding ligand types, GPCRs are often 
classified into different groups, some of which are 
further divided into subgroups, sub-subgroups, etc. The 
GPCRDB database [1,6] is one of the most popular data- 
base for GPCRs, which organizes GPCRs using a hier- 
archical structure. As in [7,8], we call each layer of this 
hierarchical structure a level. The top layer is then 
referred to as the second level (One more layer will be 
added on the top of the hierarchical structure later), and 
the second layer is referred to as the third level, etc. 
According to the latest version of the GPCRDB database 
(Version 9.9.1, September 2009), GPCRs in the second 
level are classified into five families or classes (In the 
previous versions of the GPCRDB database, e.g., June 
2006 release, GPCRs are classified into six families in 
this level); that is, (1) Class A Rhodopsin like, (2) Class 
B Secretin like, (3) Class C Metabotropic glutamatel 
pheromone, (4) Vomeronasal receptors (V1R 8cV3R), and 
(5) Taste receptors T2R. For the first four families above, 
each is further divided into subfamilies located at 
the third level. Furthermore, located on the fourth and 
fifth levels of the hierarchical structure are the sub- 
subfamilies and subtypes, respectively. On the other 
hand, given a new protein, the first step is to determine 
whether it is a GPCR or a non-GPCR. Therefore, we 



add one more level on the top of the hierarchical struc- 
ture of the above classification system. It is referred to 
as the first level. The complete hierarchical structure of 
five levels is illustrated in Figure 1. 

In this paper we will look into the following classifica- 
tion problem, which is referred to as a five-level classifi- 
cation problem. Given a protein sequence, we need to 
determine whether it is a GPCR or a non-GPCR. If it is 
predicted into a GPCR, we need to further determine 
which family, subfamily, sub-subfamily, and subtype it 
belongs to. To tackle this problem, a set of distinct clas- 
sifiers is generally needed for each level as depicted in 
Figure 1. In the literature, many computational methods 
have been proposed to predict GPCRs. However, to our 
best knowledge, there are no methods that can deal 
with the five-level problem completely, (i.e., allow to 
make predictions at all the five levels). For example, the 
methods presented in [9-12] predict GPCRs just at a 
single level (the second, third or fourth level), and the 
methods in [13] predict GPCRs only at the third and 
fourth levels. The prediction methods in [8] and [7] 
instead considered three and four levels, respectively. 

Today's academic and industrial researchers are both 
interested in the functional roles of GPCRs at the finest 
subtype level. This is mainly because each subtype 
demonstrates its own characteristic ligand binding prop- 
erty, coupling partners of trimeric G-proteins, and 
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Figure 1 The hierarchical structure for GPCRs. The organization of GPCR sequences in the GPCRDB database does not include the first level 
in this figure. We add it in this study because we performed prediction at this level. 
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interaction partners of oligomerization [14]. Therefore, 
discrimination of functions of a GPCR subtype from the 
others (i.e., prediction of GPCRs at the fifth level as 
shown in Figure 1) becomes very important in the effort 
to decipher GPCRs. However, we can expect that it is a 
challenging task that shall not be easier than the predic- 
tion of GPCRs at any of the first four levels. Fortunately, 
more and more GPCR sequences are now being accu- 
mulated into the GPCRDB database, which makes it 
possible to accurately predict GPCRs at all the five 
levels. This is the main goal of our present study. 

A lot of related work has been done previously. In 
general, there are two important components in a classi- 
fication task - one is feature extraction and the other is 
a classification algorithm. Feature extraction means how 
to extract features from protein sequences so that each 
protein is represented as a fixed-length numerical vec- 
tor. Various methods have been proposed to extract 
information from protein sequence in the past decades 
(See eg., [15-19]). The commonly-used feature extrac- 
tion methods are based on amino acid composition 
[9-11] and dipeptide composition [7,12,13,20,21], and 
more complicated ones include Chou's pseudo amino 
acid composition [15], the cellular automaton image 
approach [16], profile hidden Markov models [22], fast 
Fourier transform [23], wavelet-based time series analy- 
sis [24], and Fisher Score Vectors [25]. Once protein 
sequences are represented by numerical vectors, any 
general-purpose classification algorithms can be used for 
classification, for instance, covariant discriminant 
[9-11,16], nearest neighbor [7], bagging classification 
tree [13], and support vector machines [12,20,21,23-25]. 

In this paper, we focus on predicting GPCRs at the 
five levels. Five groups of descriptors are used to extract 
information from the amino acid sequences. These five 
groups are (1) amino acid composition and dipeptide 
composition, (2) autocorrelation descriptors, (3) global 
descriptors, (4) sequence-order descriptors, and (5) 
Chou's pseudo amino acid composition descriptors. 
These descriptors reflect various physicochemical prop- 
erties of proteins and have been adopted to predict 
many other protein attributes, such as protein subcellu- 
lar localization [19,26], outer membrane protein [27], 
nuclear receptors [28], and protein structural classes 
[17,18]. By combining these descriptors, a comprehen- 
sive set of 1497 features are calculated for each amino 
acid sequence. By applying the principal component 
analysis on a dataset, we then reduce them to a set of 
32 features that could retain as much of the data varia- 
bility as possible. 

Finally, a simple yet powerful algorithm called inti- 
mate sorting is employed to predict GPCRs, and the 
experimental tests on the benchmark datasets show that 
the classifications can be improved. Jackknife test shows 



that the overall accuracies of the proposed method at 
the first, second, third, fourth, and fifth levels achieve 
up to 99.5%, 88.8%, 80.47%, 80.3%, and 92.34%, respec- 
tively. Comparisons with several existing methods show 
that the proposed method achieves higher prediction 
performance consistently. 

Results and Discussion 

Predicting GPCR at five levels 

For simplicity, we call the proposed method PCA-GPCR. 
PCA-GPCR preforms the prediction at five levels, and its 
flowchart structure is depicted in Figure 2. By the first- 
level classifier a new protein sequence is predicted to be 
either a GPCR or a non-GPCR. If it is predicted to be a 
GPCR, it will be further classified into one of the four 
families, which is done by the second-level classifier. The 
third-level classifiers hence determine which subfamily 
the protein belongs to. For some subfamilies (see 
Additional file 1), the fourth-level classifiers are used to 
determine the sub-subfamily of the protein. Finally, the 
fifth-level classifiers determine the subtypes of the pro- 
tein, if any (see Additional file 1). We carried out the 
experiments on the collection of datasets GDFL (Please 
see the Methods section for the details of datasets). 
Jackknife tests show that the overall accuracies of PCA- 
GPCR are 99.5%, 88.8%, 80.47%, 80.3%, and 92.34% for 
the five levels, respectively. The details of experimental 
results are presented in the Additional file 1. It is com- 
monly believed that, the smaller number of training 
sequences, the less reliable a classifier to be trained. 
Therefore, it is not surprising to see that the prediction 
accuracies are higher at the first and second levels and 
relatively lower at the third and fourth levels. On the 
other hand, to filter out high-homology sequences, we 
used CD-HIT with a less stringent threshold (0.9) for the 
fifth level than the one for any other levels, which results 
in a larger number of training sequences for the fifth 
level. This might partly explain why the accuracy 
achieved for the fifth level (subtype) is higher than those 
of the second, third and fourth levels. For the conveni- 
ence of public use, a web server was already developed, 
which is freely available at http://wwwl.spms.ntu.edu.sg/ 
~chenxin/PCA_GPCR. 

Comparison with BLAST-based classification 

The most straightforward method for predicting GPCRs 
might be based on homology search by sequence align- 
ment tools such as BLAST and PSI-BLAST [29]. 
A given GPCR sequence is hence predicted into the 
class to which its most similar GPCR sequence belongs. 
However, as the pairwise sequence similarities get lower, 
such an alignment-based method would rarely yield 
satisfactory predictions. For instance, when applied to 
the dataset GDFL for the prediction at the first level, the 
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Figure 2 The structure of PCA-GPCR. For the name of the families, subfamilies, sub-subfamilies, and subtypes, please refer to the Additional 
file 1. The fourth and fifth levels are only applicable for some subfamilies and subtypes, which are also listed in the Additional file 1. 



BLAST-based method achieved the overall accuracy of 
74.58%, which is 14.92% lower than that from PCA- 
GPCR. Note that PCA-GPCR is instead an alignment- 
free method. The above experimental results therefore 
show that an alignment-free method is very promising 
in the high accurate prediction of GPCR classes. 

Comparison with previous methods 

In order to demonstrate the superior performance of 
PCA-GPCR, we make comparisons with a number of 
previous methods. Depending on the predictive capabil- 
ity of previous methods, the comparisons are made at a 
single level and at the first two levels, as follows. 
Comparison at a single level 

Because many previous methods predicted GPCR at a sin- 
gle level [9-12], we also predict GPCR at just one level in 
order to compare with them fairly. Three benchmark data- 
sets that contain a proportion of high-homology sequence 
pairs, D167, D566 and D1238, are used here (Please see 
Methods section for the details of these datasets). The 
first two datasets comprise GPCRs from the fourth level, 
and the last one is composed of GPCRs from the second 



level. The resulting prediction accuracies for these datasets 
are listed in Table 1. We can see that the overall accura- 
cies for three datasets are all above 97%. To be specific, 
the overall accuracies of 98.2%, 97.88%, and 99.76% are 
achieved for the datasets D167, D566, and D1238, respec- 
tively. They are slightly higher than the accuracies 
reported in Refs. [7,9-13,21]. Indeed, the prediction 
accuracies for individual families or sub-subfamilies are all 
very high and, in some cases, have reached 100% or nearly 
100%. 

Because the dataset D167 has been widely used to test 
various methods, it is adopted here for further detailed 
comparisons with the other five methods [7,10,12,13,21]. 
The experimental results are presented in Table 2. It is 
evident from the table that our method achieved the 
highest overall prediction accuracy. Our method per- 
forms better than any other tested methods in the pre- 
dictions of the GPCR sub-subfamilies except for the 
sub-subfamily Serotonin. 
Comparison with GPCR-CA at the first two levels 
We further compare our method with GPCR-CA [16] 
on the dataset D365, which comprises GPCRs from the 
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Table 1 The number of proteins in four datasets and the 
corresponding prediction accuracies 

Dataset Family/sub-subfamily Tot(/) c(i) ACC(%) 



Acetylcho ine 


q 1 


3 1 
3 I 


1 nn 


Adrenoceptor 


44 


44 


100 


Dopamine 


38 


36 


94.74 


Serotonin 


£A 

34 




OQ 1 C 

yo. i j 


Overall 


1 67 


164 


98.2 


Adrenoceptor 


DO 


03 


OQ A Q 

yo.4o 


Chemokine 


92 


90 


97.83 


Dopamine 


A Q 


/in 


oq m 


Neuropeptide 


31 


30 


96.77 


uiractory 


QA 

o4 


QA 

o4 


1 nn 


Rhodopsin 


1 QQ 


1 on 
I oU 


OQ Q^ 


Serotonin 


67 


65 


97.01 


Overall 


566 


554 


97.88 


Rhodopsin-like 


1103 


1102 


99.91 


Secretin-like 


84 


83 


98.81 


Metabotrophic/glutamate/pheromone 


51 


50 


98.04 


Overall 


1238 


1235 


99.76 


Rhodopsin-like 


232 


222 


95.69 


Secretin-like 


39 


34 


87.18 


Metabotrophic/glutamate/pheromone 


-1-1 


39 


88.64 


Fungal pheromone 


23 


22 


95.65 


CAMP receptor 


10 


10 


100 


Frizzled/smoothened 


17 


11 


64.71 


Overall 


365 


338 


92.6 



Tot(0 is the number of sequences observed in class /', c(() is the number of 
correctly predicted sequences of class /, and ACC is the prediction accuracy. 

second level. Unlike the datasets tested in the above 
subsection, D365 contains almost no high-homology 
sequence pairs. Note that the GPCR-CA is able to pre- 
dict GPCRs at the first two levels. 

The prediction accuracies of both GPCR-CA and 
PCA-GPCR at the first and second levels are listed in 
Table 3 and Table 4, respectively. At the first level, to 
distinguish GPCRs from non-GPCRs, our method 
achieves the overall accuracy of 95.21%, which is 3.57% 
higher than that of GPCR-CA. At the second level, the 
overall accuracy of our method improves over GPCR- 
CA by 9.04%. Meanwhile, according to the prediction 
accuracies of individual families, our method performs 



Table 3 Comparison with GPCR-CA in identifying the 
GPCRs and non-GPCRs 



Protein type 


GPCR-CA [16] 


This paper 


GPCR 


92.33 


96.99 


Non-GPCR 


90.96 


93.42 


Overall 


91.64 


95.21 



The results of GPCR-CA are directly taken from the Ref. [16]. 



much better than GPCR-CA except for the rhodopsin- 
like family. It is also noticeable that a substantial 
improvement of 86.95% (= 95.65% -8.70%) has been 
made for the prediction of the fungal pheromone family 
(partly due to the small size of protein sequences in this 
family, as shown in Table 1). 

GPCR-CA extracts 24 features, including 20 features 
from amino acid composition and four features from 
cellular automaton image [16]. While the last four fea- 
tures were reported to be able to reveal the protein's 
overall sequence patterns, only four features might not 
suffice to reveal overall sequence patterns completely. 
On the contrary, our method explores the amino acid 
sequences comprehensively to gain as much information 
from the protein primary sequences as possible. Both 
the amino acid composition and the dipeptide composi- 
tion are utilized in our method and, moreover, the 
important sequence-order information and a variety of 
physicochemical properties of amino acids are carefully 
explored as well. We believe that it is this comprehen- 
sive set of features that lead our method to a higher pre- 
diction accuracy. 

Conclusions 

In this paper, we have proposed a new method called 
PCA-GPCR to predict GPCRs at five levels. In this 
method, a comprehensive set of 1497 sequence-derived 
features are generated from five groups of descriptors - 
that is, amino acid composition and dipeptide composi- 
tion, autocorrelation descriptors, global descriptors, 
sequence-order descriptors, and Chou's pseudo amino 
acid composition descriptors. These features are able 
to capture the information about the amino acid com- 
position, sequence order as well as various physico- 
chemical properties of proteins. Because of the high 



Table 2 Comparison with other methods at the fourth level based on the D167 dataset 


Reference 


Acetylcholine 


Adrenoceptor 


Dopamine 


Serotonin 


Overall 


[10] 


67.74 


88.64 


81.58 


88.89 


83.23 


[13] 


90.3 


86.4 


78.9 


79.6 


83.2 


[12] 


93.6 


100 


92.1 


98.2 


96.4 


[7] 


93.3 


100 


94.7 


100 


97.6 


[21] 


96.7 


100 


92.1 


100 


97.6 


This paper 


100 


100 


94.74 


98.15 


98.2 



The results of the other methods are taken directly from the corresponding references. 
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Table 4 Comparison with GPCR-CA for the dataset D365 
in predicting GPCR families 



Family 


GPCR-CA 


This paper 


Rhodopsin-like 


96.55 


95.69 


Secretin-like 


74.36 


87.18 


Metabotrophic/glutamate/pheromone 


81.82 


88.64 


Fungal pheromone 


8.70 


95.65 


CAMP receptor 


60 


100 


Frizzled/smoothened 


47.06 


64.71 


Overall 


83.56 


92.60 


The results of GPCR-CA are directly taken from the Ref. [16]. 



dimensionality of the feature space, the principal com- 
ponent analysis is hence used to reduce the dimension 
from 1497 to 32. The resulting 32-dimensional feature 
vectors are finally fed into a simple yet powerful inti- 
mate sorting algorithm for the prediction of GPCRs at 
five levels. 

By evaluating on the datasets constructed from the lat- 
est version of the GPCRDB database, the overall accura- 
cies of our method from the first level to the fifth level 
are 99.5%, 88.8%, 80.47%, 80.3%, and 92.34%, respectively. 
We further test and compare our method with several 
other methods based on four benchmark datasets widely 
used in the literature. At the second level, for a dataset 
containing 1238 GPCRs, the overall accuracy of our 
method reaches 99.76%. At the fourth level, for two dif- 
ferent datasets that contain 167 and 566 GPCRs, the 
overall accuracies of our method reach up to 98.2% and 
97.88%, respectively. They are all higher than those of the 
other methods under comparison. At the first two levels, 
we further test our method on a low-homology dataset 
(with only a few sequence pairs of more than 40% 
sequence identity). The overall accuracies thus achieved 
at the first level and second level are 95.21%, 92.6%, 
respectively, which are 3.57% and 9.04% higher than 
those of the method GPCR-CA. 

We conclude that the high prediction accuracy of the 
proposed method is attributed to the comprehensive set 
of features that we constructed from five groups of 
descriptors. It is anticipated that our method could con- 
tribute more to the characterization of novel proteins 
and gain new insights into their functions, thereby facili- 
tating drug discovery. A web server that predicts GPCRs 
at five levels with our proposed method is freely available 
at http://wwwl.spms.ntu.edu.sg/~chenxin/PCA_GPCR. 

Methods 

Datasets 

We construct a collection of non-redundant datasets from 
the latest release of the GPCRDB database (Version 9.9.1, 
September 2009) [6] to evaluate and train the classifiers 
for the GPCRs prediction. As mentioned in the 



Background section, the sequences in the GPCRDB data- 
base are organized in four levels: family or class, subfamily, 
sub-subfamily, and subtype. We download the GPCR 
sequences from the GPCRDB database and then filter out 
the high-homology sequences using the program CD-HIT 
[30]. In order to ensure that there are enough sequences 
to train the classifiers, we apply different thresholds in 
CD-HIT for sequences at different levels. They are 0.4, 
0.7, 0.8, and 0.9 for the family, subfamily, sub-subfamily, 
and subtype levels, respectively. After filtering, only 
families (subfamilies, sub-subfamilies, and subtypes) with 
more than 10 sequences are retained for training classi- 
fiers. Because the fifth family {Taste receptors T2R) has no 
subfamily and there are only 14 sequences remaining after 
filtering by CD-HIT, it is therefore ignored in subsequent 
analysis. At the end, we obtained 1589, 4772, 4924, and 
2741 GPCRs at the family, subfamily, sub-subfamily and 
subtype levels, respectively. The name of families, subfami- 
lies, sub-subfamilies, and subtype, together with the num- 
ber of GPCR proteins retained at each level are listed in 
the Additional file 1. 

The GPCR protein sequences retained at the family 
level are used to construct a positive dataset for training 
and evaluation. A negative dataset of non-GPCRs is then 
constructed in almost the same way as in Ref. [25], 
except that the latest version of ASTRAL SCOP (Version 
1.75) [31] is used. First, we download the sequences that 
have less than 40% identity to each other (i.e., the file 
with the name "seq.75;item = seqs;cut = 40"). Then, 
remove those sequences of length less than 30, and those 
having identity above 40% using CD-HIT. Finally, a total 
of 10325 sequences remain, from which 1589 sequences 
are randomly selected to form a negative dataset. Because 
these selected proteins are organized into five levels, for 
the sake of convenience, we call them the datasets GDFL 
(GPCR Datasets in Five Levels). They are available at the 
web server provided in this paper. 

In addition, in order to perform comparison with 
other existing methods directly, four benchmark datasets 
from previous studies are experimented in this study as 
well. For the sake of simplicity, they are referred to as 
D167, D566, D1238 and D365, respectively. We know 
that all of them were constructed based on the older 
version of the GPCRDB database. The proteins in the 
dataset D167 [10] (belonging to the fourth level) are 
classified into four sub-subfamilies: (1) acetylcholine, (2) 
adrenoceptor, (3) dopamine, and (4) serotonin. The data- 
set DS66 [11] (belonging to the fourth level) instead 
comprises proteins in seven sub-subfamilies: (1) adreno- 
ceptor, (2) chemokine, (3) dopamine, (4) neuropeptide, 
(5) olfactory type, (6) rhodopsin, and (7) serotonin. The 
dataset D1238 [9] (belonging to the second level) com- 
prises proteins from three families: (1) rhodopsin like, 
(2) secretin like, and (3) metabotrophic/glutamate/ 
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pheromone. The last dataset D365 [16] (belonging to the 
second level) comprises proteins in the six families: (1) 
rhodopsin-like, (2) secretin-like, (3) metabotrophic/gluta- 
mate/pheromone; (4) fungal pheromone, (5) cAMP recep- 
tor and (6) frizzled/smoothened family. The numbers of 
proteins in the above four datasets are given in Table 1. 
Furthermore, 365 non-GPCR sequences are taken from 
the Swiss-Prot database to serve as a negative dataset 
against D36S [16]. 

The sequence homology level is an important factor that 
affects the effectiveness of a classification method. There- 
fore, it is worthwhile to take a look at the sequence simi- 
larity levels of proteins in these datasets before performing 
any evaluation test. For simplicity, we analyze the similar- 
ity level of the whole dataset rather than the subsets in the 
dataset. Chou and Elrod [9-11] reported that all the recep- 
tor sequences in the aforementioned datasets were gener- 
ally lower than 40%, according to their definition of the 
average sequence identity percentage between two protein 
sequences. Here, we run a protein sequence clustering 
program called CD-HIT [30] on each dataset with the 
varying thresholds of sequence identity. For example, if a 
threshold of 0.9 is used, the proteins having pairwise resi- 
due identities of 90% or above would be placed into a 
same cluster. In general, the fewer resulting clusters imply 
the higher overall sequence similarities. The test results 
are shown in Table 5, where the proteins are clustered 
with the thresholds of 0.9, 0.8, 0.7, 0.6, 0.5 and 0.4, respec- 
tively. In particular, 100 clusters are obtained for 167 pro- 
teins in the dataset D167 with the threshold of 0.9. It 
indicates that there do exist high-homology protein pairs, 
but they only take up a small proportion of the total num- 
ber (i.e., 12861 = 167 x 166/2) of distinct protein pairs. 
The use of the threshold of 0.4 further reduces the num- 
ber of clusters to 30, which could suggest that the average 
sequence identity of proteins is quite low. However, to 
avoid the overestimation of prediction accuracy, it would 
be better if those high-homology sequences can be filtered 
out with CD-HIT. For instance, the dataset D365 does not 



Table 5 The CD-HIT clustering results for the four 
benchmark datasets 



r 




Dataset 






D167 


0566 


D1238 


D365 


1.0 


167 


566 


1238 


365 


0.9 


100 


346 


777 


361 


0.8 


73 


226 


540 


361 


0.7 


61 


169 


421 


361 


0.6 


52 


142 


358 


359 


0.5 


38 


106 


281 


357 


0.4 


30 


69 


207 


356 



/denotes the threshold for the sequence identity percentage. The row of 
7= 1.0 gives the total number of proteins in each dataset. 



contain any protein pairs having > 40% pairwise sequence 
identity except in the E-cAMP receptor family, which con- 
tains too few (only 10) GPCRs to apply filtering. 

Physicochemical properties 

In order to capture as much information of protein 
sequences as possible, a variety of physicochemical 
properties [32] are used in the procedure of feature 
extraction. These physicochemical properties are listed 
in Table 6, of which the first eighteen are used to mea- 
sure the physicochemical properties of individual amino 
acids and the last two to measure the physicochemical 
distances between two amino acids. 

Sequence-derived features 

As mentioned in the introduction, amino acid composi- 
tion was widely used to transform GPCR sequences into 
20-dimension numerical vectors [9-11]. However, the 
sequence order information would be completely lost. In 
order to address this issue, dipeptide composition was pro- 
posed to represent GPCR sequences by 400-dimension 
vectors, which captures local-order information and has 
been reported to improve classifications [7,12,13,20,21]. 
Recently, GPCR-CA [16] utilized the conception of Chou's 
pseudo amino acid composition [33] to represent each pro- 
tein sequence by 24 features. The first 20 features corre- 
spond to the amino acid composition and the remaining 
four features are calculated from a so-called cellular auto- 
mation image. These four features were shown capable of 
reflecting a protein's overall sequence pattern. Inspired by 
this work, we seek a new set of features that can compre- 
hend as much information as possible from GPCR 
sequences. To this end, we investigate the following five 
groups of features, where the parameters are set to the 
same values as in [34]. 

Amino acid composition (AAC) and dipeptide composition 
(DC) 

Amino acid composition is defined as the occurrence 
frequencies of 20 amino acids in a protein sequence. 
That is, 

/a(0 = ^ (1) 

where each i = 1, 2, ... , 20 corresponds to a distinct 
amino acid and n A (i) is the number of amino acid i 
occurring in the protein sequence of length L. 

Similarly, dipeptide composition is defined as the 
occurrence frequencies of the 400 dipeptides (i.e., 400 
amino acid pairs). That is, 
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Table 6 The physicochemical properties of the amino acids and distances between two amino acids 



Order Physicochemical property Range of property Reference 



1 


1— h/H rnnhnKiritv t:r~^lpt; 
nyuiujji luuicny iLdicb 


[-1 14 1 81] 


L->zj 


n 
z 


Average flexibility indices 




["391 


j 


Po arizability parameter 


rn n zlhqi 
Lu, u.'M-uyj 


r^9i 


/] 


Prpp pnprriu nf ^rtliitinn in watpr 
i i cc ci ici y y ui ouiuuui i n i vvdici 


[-2 24 491] 


L3ZJ 


5 


RpciHi ip arrpccihlp ci irfsrp 3t"P3 in trinpntirlp 

nCJlUUC O^CCoolUIC oUllo^C al cd III LIIUCUlllJC 


rvc: ncri 
L' ->/ ZJJJ 


R9i 




RpciHi ip \/r\ i imp 


T^n ^ 13S41 
I.JU.J, I JJ.HJ 


L->zj 


7 


jLcML (JdldMICLCl 


rn 1 n?i 

|_U, 1 .UZJ 


r^9i 


Q 

o 


Relative mutability 


[ 1 O, 1 DH] 


r^9i 


g 


Hvrl rnnhnhirifv 
i \y\j i upi luuiLi iy 


[-2.53, 1 .38] 


RTI 


1 n 


nyui upi iiiiLiLy 


L j.^v 




ii 


Side-chain mass 


[1, 130] 


[33] 


12 


Normalized van der Waals volume 


[0, 8.08] 


[34] 


13 


Polarity 


[4.9, 13.0] 


[34] 


14 


Polarizability 


[0, 0.409] 


[34] 


15 


Charge 


Positive, Neutral, Negative 


[34] 


16 


Secondary structure 


Helix, Strand, Coil 


[34] 


17 


Solvent accessibility 


Buried, Exposed, Intermediate 


[34] 


18 


Relative hydrophobicity 


Polar, Neutral, Hydrophobic 


[34] 


19 


Grantham chemical distance 


[0, 215] 


[34] 


20 


Schneider-Wrede physicochemical distance 


[0, 1] 


[19] 



where each i = 1, 2, 400 corresponds to one of the 
400 dipeptides and n D (i) is the number of dipeptide i 
occurring in the sequence. 
Autocorrelation descriptors (AD) 

We use three autocorrelation descriptors - normalized 
Moreau-Broto autocorrelation descriptors, Moran auto- 
correlation descriptors and Geary autocorrelation 
descriptors. They are all defined based on the value dis- 
tributions of the first eight physicochemical properties 
of amino acids along a protein sequence (see Table 6). 
The measurement values of these properties are first 
standardized before we proceed to calculate the three 
autocorrelation descriptors. The standardization is per- 
formed as follows. 

m = MtFo i f = 1/2 2a (3 ) 

CT 

where P 0 (i) are the property value of the amino acid i, 

20 I 20 Z 

P o=^£^oW,and a= U£(P 0 (i)-P 0 ) 2 . 

Normalized Moreau-Broto autocorrelation descrip- 
tors are defined as: 

NMBA(d)= hABA ^-, d = l,2 30, (4) 

L—d 

L-d 

where MBA(d) = P[ R i) p i R i+d)' R i and R i + d are 

i=i 

the amino acids at position i and i + d along the protein 



sequence, respectively. As mentioned earlier, we use the 
same parameter values as in [34], so the maximum 
value of d is 30. 
Moran autocorrelation descriptors are defined as: 

MA(d) = j , d = l,2,...,30, (5) 

i=l 

L 

where P = i^P(P;) is the average value of the 

i=i 

property of interest along the sequence. 
Geary autocorrelation descriptors are defined as: 

GA(d) = ^4 , d = l,2 30. (6) 

For each autocorrelation descriptor, we obtain 240 
(= 30 x 8) features. In total, 720 (= 240 x 3) features 
will be obtained to describe a protein sequence. 
Global descriptors (GD) 

These descriptors were first proposed by Dubchak et al. 
[35] to predict protein folding classes, and later applied to 
predict human Pol II promoter sequences [36]. They are 
constructed as follows. Firstly, given each of the following 
seven amino acid properties: normalized van der Waals 
volume, polarity, polarizability, charge, secondary 
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structure, solvent accessibility and relative hydrophobicity 
(i.e., properties 12-18 listed in Table 6), the 20 amino acids 
are divided into three groups according to their property 
values. Then, for a given amino acid sequence, we may 
obtain a new sequence of three symbols, each correspond- 
ing to one group of amino acids. Finally, three groups of 
quantities are defined on the new sequence; that is, com- 
position {Comp), transition {Tran) and distribution (Dist), 
as demonstrated below. 

For the sake of simplicity, suppose that a sequence is 
made of only two letters (A and B). Comp is defined as 
the occurrence frequency of each letter in the sequence. 
For example, we have a sequence BABBABABBABBAA- 
BABABBAAABBABABA, in which there are 14 As and 
16 Bs. Therefore, the occurrence frequencies of A and B 
are 14/(14 + 16) x 100.00 = 46.67 and 16/(14 + 16) x 
100.00 = 53.33, respectively. Tran is used to represent 
the occurrence frequency of pairs AB or BA. In the above 
sequence, there are 21 transitions from one letter to 
another, so Tran is computed as (21/29) x 100.00 = 
72.14. On the other hand, Dist calculates the relative 
positions of the first, 25%, 50%, 75% and 100% of the 
total amount of a particular letter in the sequence. In the 
above sequence, for example, the first, 25%, 50%, 75% 
and 100% of the total amount of the letter B are located 
at the first, 6th, 12th, 20th and 29th positions, respec- 
tively. The quantities Dist for the letter B are hence 1/30 
x 100.00 = 3.33, 6/30 x 100.00 = 20.00, 12/30 x 100.00 = 
40.00, 20/30 x 100.00 = 66.67 and 29/30 x 100.00 = 
96.67. Similarly, we can find the Dist values for the letter 
A; they are 6.67, 23.33, 53.33, 73.33 and 100.00. At the 
end, the global descriptors of the above sequence become 

(Comp;Tran;Dist) = (46.67, 53.33; 72.14; 6.67, 23.33, 
53.33, 73.33, 100.00, 3.33, 20.00, 40.00, 66.67, 96.67) 

Suppose there are n distinct symbols in a sequence, 
then the number of features in Comp, Tran, and Dist 

are ^"J, and 5 x n, respectively. Recall that the 20 

amino acids are divided into three groups by each 
amino acid property, which leads to a new sequence 
of three symbols (« = 3). Following the similar proce- 
dure demonstrated above, we will obtain 2i|=^j+^j + 5x 3 j 

features to describe the new sequence (of three sym- 
bols). Combining all the features to be extracted based 
on the seven amino acid properties, we will obtain a 
total of 147 (= 21 x 7) features for each input protein 
sequence from the global descriptors. 
Sequence-order descriptors (SD) 

In order to derive sequence-order descriptors, we rely 
on two distance measures for amino acid pairs. One is 
called the Grantham chemical distance matrix [34], and 
the other called the Schneider-Wrede physicochemical 



distance matrix [19]. Then, the jth-rank sequence-order- 
coupling number is defined as: 



rU) = ^m,R i+} )) 2 , j = 1,2 30, 



(7) 



where d{R t , Rj +/ ) is one of the above distances 
between the two amino acids R, and 7?, + ; located at 
position i and i + /', respectively. 

The quasi-sequence-order descriptors are defined as: 



QSO[i) ■■ 



/a(0 



;=1 j=l 

ft) ■ 

~20 30 

X/ A (j) + «5>U) 



(1 < i < 20) 



, (21<i<50) 



(8) 



where m is a weighting factor (default m = 0.1). 

We end up with 60 (= 30 x 2) sequence-order-cou- 
pling numbers and 100 (= 50 x 2) quasi-sequence-order 
descriptors. In total, there are 160 features extracted 
from the sequence-order descriptors. 
Chou'is pseudo amino acid composition descriptors 
(PseAAC) 

This set of features were originally developed by Chou 
[33] and have been used widely to predict various attri- 
butes of proteins, such as outer membrane protein [27], 
nuclear receptors [28], and protein structural classes 
[17,18]. The Chou's pseudo amino acid composition 
descriptors are defined similarly as the quasi-sequence- 
order descriptors. The difference lies in the coupling 
number r (/'), which is modified to: 



L-d 

0{d) = j^ d ^Q{RuRi+dl d = l,2 30, 

!=1 



(9) 



where 0 (/?,, 7?, + 2i is the <ith-tier correlation factor that 
reflects the sequence order correlation between all the most 
contiguous residues along a protein chain. It is denned as: 



®(Ri,R i+ti ) = \^[H k (Ri) - H k {R 1+d )] 2 , 



(10) 



where H^Rj), H 2 (Ri) and H 3 (Rj) are the hydrophobicity, 
hydrophilicity, and side-chain mass of amino acid, respec- 
tively [33]. Their original values are standardized in the 
same way as we have done in the definition of autocorrela- 
tion descriptors (i.e., eq. (3)). Finally, the Chou'is pseudo 
amino acid composition descriptors are defined as: 
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PseAAC(i) - 



-, (l<i<20) 



m ■ 0(i) 



(ID 



30 -< (21<i<50) 

£/ A 0) + «£e(d) 

j=l d=l 



where co is a weighting factor (default co = 0.1). It will 
generate 50 features from the Chou'is pseudo amino 
acid composition descriptors. 

In summary, a comprehensive set of 1497 features, 
which measure the protein sequences from different 
aspects, will be generated from the above five descrip- 
tors. The number of features in each group of descrip- 
tors is listed in Table 7. These features are used to 
represent every protein sequence, and may be directly 
fed into a classification algorithm. Note that, however, 
there are some correlations or redundancies among 
these features such as the first twenty features in the 
fourth and fifth groups of features. On the other hand, 
the dimension of the features is too large, which might 
make it difficult to work with many machine learning 
algorithms for classification. Therefore, it is necessary to 
reduce the dimension. In this study, we adopt one of 
the most popular and powerful techniques, namely, 
principal component analysis, for the purpose of dimen- 
sionality reduction. 

Principal component analysis 

Principal Component Analysis (PCA) is a classical statis- 
tical method which is still widely used in modern data 
analysis. PCA involves a mathematical procedure that 
transforms a large number of (possibly) correlated vari- 
ables into a smaller number of uncorrected variables, 
called principal components (PCs), that retain as much 
variability of the data as possible [37]. Given a data 
matrix denoted by X = (X lt X 2 , X p ), where X t is a col- 
umn vector of size n which is equal to the number of 



Table 7 The number of features in each group of 
descriptors 



Order Name 



Number of 
features 



(i) 


Amino acid composition (AAC) and dipeptide 
composition (DC) 


420 


(ii) 


Autocorrelation descriptors (AD) 


720 


(iii) 


Global descriptors (GD) 


147 


(iv) 


Sequence-order descriptors (SD) 


160 


(V) 


Chou'is pseudo amino acid composition 
descriptors (PseAAC) 


50 


(vi) 


All features 


1497 



proteins of interest and p denotes the number of protein 
sequence features, a typical PCA is performed as follows. 
First, we shall standardize every X, by 



X j — X j 



i = \,2,...,p, 



(12) 



where X { and Var(X,) are the mean and variance of 
the vector components of X h respectively. Then, the 
covariance matrix of Y = (Y u Y 2 , Y p ) is obtained as 



Cov(Y) = 



1 

n-\ 



Y T Y. 



(13) 



For the covariance matrix Cov(Y), we find all its eigenva- 
lues Ai > A 2 > ... > X p and the corresponding eigenvectors, 
Ei, E 2 , E p . Note that each E t = (£,, i, E h 2 , -, P ) T is a 
column vector of size p and E lt E 2 E p are linearly uncor- 
rected according to the basic knowledge of linear algebra. 
Finally, we construct the f-th PC PC(i) as the linear combi- 
nation of Y lt Y 2 , Yp with the coefficients being the ele- 
ments of the /-th eigenvector i.e., 



pc(i) = ]T 



p. Y ■ 



i = \,2,...,p. 



(14) 



We can see that each PC(i) is a column vector with 
size n and the j-th element in PC(i) represents the i-th 
PC value of protein /'. Thereafter, a total of p uncorre- 
cted PCs are obtained. 

In order to reduce the dimension of the feature space, 
only the first m PCs are used to represent each protein 
sequence (m < p). It is generally hard to determine the 
optimal value of m. In this study, we aim to find a value 
of m that could make the overall prediction accuracy of 
GPCRs as high as possible, which we will further discuss 
later. 

Intimate sorting algorithm 

Many classification algorithms in the literature have 
been used to predict GPCRs, for instance, covariant dis- 
criminant [9-11,16], nearest neighbor [7], bagging classi- 
fication tree [13], and support vector machines 
[12,20,21,23-25]. In this study, we use a simple yet 
powerful algorithm called intimate sorting [26]. This 
algorithm is easy to implement and does not need to set 
any parameters as some other algorithms (e.g., support 
vector machines). 

Suppose that a training set consists of N proteins {P lt 
P 2 , Pat}, each of which P, is a A-dimension vector, 
P/ = (Pi, i» Pi, 2> — > Pi> A) T . The GPCR classes of these 
proteins are already known, and each protein belongs to 
exactly one of the \i classes. The intimate sorting 
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algorithm aims to place a query protein P = (p x , P2> •••> 
px) into one of the pi classes based on the information 
from the N proteins in the training set. To this end, a 
measure of similarity score between P and P, is defined 
as 



0(P,P,): 



PP, 



P Pi 



1, 2, 



(15) 



where p P i = l> J Ml p ll = ^|>, 2 ■ When P = P„ it can be 
easily seen that 0(P, P,) = 1, suggesting that they are 
most likely to belong to a same class. In general, we 
have -1 < 0(P, P,) < 1. The higher the ®(P, P,) value, 
the more likely two proteins belong to a same class. 
Among the N proteins in the training set, the one with 
the highest score with the query protein P is picked out, 
which we denote by P/ c , k e [1, N]. If there is a tie, we 
would randomly select one of them. In the final step, 
the intimate sorting algorithm simply assigns P into the 
same GPCR class as P^. 

Prediction assessment 

The jackknife test is a rigorous and objective statistical test 
that can always yield a unique result for a given test data- 
set [38] . Therefore, it is often used to examine the power 
of a new classifier. In this paper, we also use it to evaluate 
our method, where proteins are singled out from the data- 
set one by one as a testing protein and the classifier is 
trained by the remaining proteins. In this sense, the jack- 
knife test is also called the leave-one-out test. The predic- 
tion accuracies (ACC) and overall accuracy {OACC) are 
then measured by the following formulae: 



ACC{i) 



Tot(i) 



i = 1, 2, . 



(16) 



OACC 



£c(0 

i 

X Tot « 



i = l,2,...,pi, 



(17) 



where Tot(i) is the total number of sequences in class 
i, C{i) the number of correctly predicted sequences of 
class i, and pi the total number of classes under consid- 
eration. Note that this prediction assessment method 
was already adopted in several previous studies, e.g., 
[7,15,16,24]. 

Selection of m 

As we mentioned earlier, the number of PCs in PCA, 
i.e., m, remains to be determined. Here, we choose its 
value by aiming to achieve the overall prediction 



accuracy as high as possible. To this end, we use the 
dataset D365 to compute the overall prediction accura- 
cies OACCs of GPCR families for varying values of m. 
When m ranges from 1 to 80, OACCs thus obtained are 
plotted in Figure 3. We found that the highest accuracy 
(92.6%) is achieved with m = 32. Based on this observa- 
tion, we chose m = 32 in our experiments. 




Figure 3 Selection of m. The overall prediction accuracies of GPCR 
families for the D365 dataset obtained by varying the number m of 
principle components. The highest overall accuracy is achieved 
when m = 32, which is marked by the dotted lines. 




GD SD PseAAC 



lllHl 



Figure 4 Contribution of features. The meanings of the notations 
AAC, DC, AD, GD, SD, and PseAAC can be found in Table 7. The 
divisions of these six subsets are marked by vertical blue lines. 
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Contribution of features 

Inspired by the PCA-based feature selection method 
described in [37,39], we use the following procedure to 
assess the contributions of the 1497 features to predic- 
tion accuracy. Recall that in the previous principle com- 
ponent analysis on the dataset D365 we obtained 32 
eigenvectors E l , E 2 , E 32> and each eigenvector com- 
prises 1497 components. Let us denote £, = (£y), where 1 
< i < 32 and 1 < / < 1497. To find the i-th PC in PCA, E i} 
is used to weight the /-th feature. In this sense, the value 
E t j can be viewed as the weight of contributions that the 
/-th feature makes to the i-th PC. To combine the contri- 



butions to all the PCs, we may compute Wj = l\ E 



Then, Wj can be naturally viewed as the weight of contri- 
butions that the /'-th feature makes to the final prediction 
accuracy because our method is based on these 32 PCs. 
In general, the higher the weight Wj, the more contribu- 
tions the /-th feature makes. 

The contribution of each of the 1497 features is com- 
puted and depicted in Figure 4(A), where we can see 
that the contributions of the amino acid composition 
(AAC) in the first group of descriptors are much higher 
than those of the dipeptide composition (DC). There- 
fore, we separate the AAC from the DC in the first 
group of descriptors in the following discussions. In 
addition, we find that the features from the autocorrela- 
tion descriptors (AD) made the highest contributions 
among all the features. Because there are 1497 features, 
it is not convenient to discuss the contributions of all 
individuals one by one. Instead, we compute the average 
contributions of the features in the following six subsets: 
AAC, DC, AD, GD, SD, and PseAAC. Their results are 
shown in Figure 4(B). It is evident from the figure that 



P3.1 P3.2P3.3 






Figure 5 Contribution of features in the AD subset. The 

divisions of these eight groups are marked by vertical blue lines. 
Among the groups P3, P5 and P6, their divisions are marked by 
vertical red lines. 



the highest average contribution is obtained with the 
features in the PseAAC subset (0.1657). The slightly 
lower contributions are provided by the AD and AAC 
subsets (0.1595 and 0.1579, respectively). On the con- 
trary, the features in the GD and SD subsets achieve the 
average contributions only slightly higher than 0.14. The 
features in the DC subset instead achieve the least aver- 
age contribution (0.1092). In summary, if we arrange the 
features in the six subsets in a decreasing order of their 
average contributions, then we obtain PseAAC, AD, 
AAC, GD, SD, and DC. 

In particular, among the AD subset, some features 
made quite high contributions while the others made 
relatively low contributions, as we can see in Figure 4 
(A). For a thorough investigation, we plot the contribu- 
tions of all the features in the AD subset again in Figure 
5. These features are divided into eight groups according 
to the physicochemical properties used to compute 
them. In Figure 5, the eight groups of features are sepa- 
rated by vertical blue lines and indicated by PI, P2, 
P8, respectively. Note that Pi represents the i-th physi- 
cochemical property listed in Table 6. It is evident from 
Figure 5 that the highest contributions are due to the 
features computed with the physicochemical properties 
P3, P5 and P6; they are the polarizability parameter, 
residue accessible surface area in tripeptide and residue 
volume, respectively. For the group P3, we can further 
divide its 90 features into three subgroups according to 
three different autocorrelation descriptors (normalized 
Moreau-Broto, Moran, and Geary autocorrelation 
descriptors). These three subgroups are separated by red 
vertical lines in Figure 5, and indicated by P3.1, P3.2 
and P3.3, respectively. In each subgroup, the feature 
contributions are computed with the values of d varying 
from 1 to 30 (from left to right on the horizontal axis). 
Observe that Moran and Geary autocorrelation descrip- 
tors (P3.2 and P3.3) made much higher contributions 
than the normalized Moreau-Broto descriptors (P3.1). 
Furthermore, for Moran and Geary autocorrelation 
descriptors, the features that are computed with a value 
of d in the range from 20 to 30 generally give rise to a 
fairly high contribution, while the maximum is attained 
when d = 26. The similar characteristics can also be 
observed for the groups P5 and P6 from Figure 5. 

Additional material 



Additional file 1: The information about families, subfamilies, sub- 
subfamilies, and subtypes. The names of families, subfamilies, sub- 
subfamilies, and subtypes used by PCA-GPCR are listed in this file. The 
names are derived from GPCRDB database. The number of proteins in 
each family, subfamily, sub-subfamily, subtype, and the corresponding 
accuracies are also available in this file. 
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